Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in SwitchTraffic that wasn't respecting --dry_run for readonly and replica tablets during a resharding event #12992

Conversation

austenLacy
Copy link
Contributor

@austenLacy austenLacy commented Apr 27, 2023

Description

There's a bug when resharding and switching traffic that makes it so it does not respect the --dry_run flag.

❌ does not respect --dry_run without the fix in traffic_switcher.go

Note the no locksInfo error

vtctlclient Reshard -- --tablet_types=rdonly,replica SwitchTraffic --dry_run customer.cust2cust

I0426 14:42:58.221613    9828 main.go:96] I0426 14:42:58.221425 traffic_switcher.go:399] About to switchShardReads: [], [RDONLY REPLICA], 0
E0426 14:42:58.223739    9828 main.go:96] E0426 14:42:58.223117 traffic_switcher.go:401] switchShardReads failed: Code: INVALID_ARGUMENT
keyspace customer is not locked (no locksInfo)
E0426 14:42:58.224562    9828 main.go:96] E0426 14:42:58.223567 vtctl.go:2264] 
keyspace customer is not locked (no locksInfo)
I0426 14:42:58.236510    9828 main.go:96] I0426 14:42:58.236337 vtctl.go:2266] Workflow Status: Reads Not Switched. Writes Not Switched

Following vreplication streams are running for workflow customer.cust2cust:

id=1 on -80/zone1-0000000301: Status: Running. VStream Lag: 0s.
id=1 on 80-/zone1-0000000400: Status: Running. VStream Lag: 0s.

Reshard Error: rpc error: code = Unknown desc = keyspace customer is not locked (no locksInfo)
E0426 14:42:58.251351    9828 main.go:105] remote error: rpc error: code = Unknown desc = keyspace customer is not locked (no locksInfo)

✅ does respect --dry_run with the fix in traffic_switcher.go

Note no more locksInfo error

vtctlclient Reshard -- --tablet_types=rdonly,replica SwitchTraffic --dry_run customer.cust2cust
...
Dry Run results for SwitchTraffic run at 26 Apr 23 15:14 UTC
Parameters: --tablet_types=rdonly,replica SwitchTraffic --dry_run customer.cust2cust

Lock keyspace customer
Switch reads from keyspace customer to keyspace customer for shards 0 to shards -80,80-
Unlock keyspace customer

testing on the primary

On v15 I wasn't able to replicate the issue with the primary tablets going to NOT SERVING because SwitchTraffic did respect the dry run flag when dealing with the primary.

# without the fix
vtctlclient Reshard -- --tablet_types=primary SwitchTraffic --dry_run customer.cust2cust
...
Dry Run results for SwitchTraffic run at 26 Apr 23 15:15 UTC
Parameters: --tablet_types=primary SwitchTraffic --dry_run customer.cust2cust

Lock keyspace customer
Stop writes on keyspace customer, tables [/.*]:
        Keyspace customer, Shard 0 at Position MySQL56/83376623-e444-11ed-87cb-4201f0803195:1-151
Wait for VReplication on stopped streams to catchup for upto 30s
Create reverse replication workflow cust2cust_reverse
Create journal entries on source databases
Enable writes on keyspace customer tables [/.*]
Switch routing from keyspace customer to keyspace customer
IsPrimaryServing will be set to false for:
        Shard 0, Tablet 200
IsPrimaryServing will be set to true for:
        Shard -80, Tablet 301
        Shard 80-, Tablet 400
SwitchWrites completed, freeze and delete vreplication streams on:
        tablet 301
        tablet 400
Start reverse replication streams on:
        tablet 200
Mark vreplication streams frozen on:
        Keyspace customer, Shard -80, Tablet 301, Workflow cust2cust, DbName vt_customer
        Keyspace customer, Shard 80-, Tablet 400, Workflow cust2cust, DbName vt_customer
Unlock keyspace customer

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

None

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Apr 27, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Apr 27, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@github-actions github-actions bot added this to the v17.0.0 milestone Apr 27, 2023
@mattlord mattlord self-assigned this Apr 27, 2023
@mattlord mattlord added Type: Bug Component: VReplication and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Apr 27, 2023
rohit-nayak-ps added a commit to planetscale/vitess that referenced this pull request Apr 28, 2023
…icSwitch dry run for reads during resharding tries to actually switch reads and fails

Signed-off-by: Rohit Nayak <[email protected]>
@rohit-nayak-ps
Copy link
Contributor

Thanks @austenLacy for the bug report and fix! Can you please fix the DCO for your commit so it passes CI?

Also I updated the e2e tests to reproduce this failure and confirm your fix, at 0567da2. Can you also cherry-pick that commit into your PR. Previously dry-run for switching read traffic was only tested for MoveTables, I added it for Reshards as well.

austenLacy and others added 2 commits April 28, 2023 09:21
…icSwitch dry run for reads during resharding tries to actually switch reads and fails

Signed-off-by: Rohit Nayak <[email protected]>
@austenLacy austenLacy force-pushed the austenlacy/respect-dry-run-flag-on-reshard-switchtraffic branch from 8b842bb to 99540f4 Compare April 28, 2023 13:22
@austenLacy
Copy link
Contributor Author

Thanks for the e2e test @rohit-nayak-ps. Just signed off my original commit and cherry picked yours in.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @austenLacy ! ❤️

@mattlord mattlord merged commit dc5ec10 into vitessio:main Apr 28, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Apr 28, 2023

I was unable to backport this Pull Request to the following branches: release-14.0, release-15.0, release-16.0.

maksimov pushed a commit to slackhq/vitess that referenced this pull request Sep 9, 2024
…donly and replica tablets during a resharding event (vitessio#12992)

* use switcher struct when switching shard reads during a reshard event

Signed-off-by: austenLacy <[email protected]>

* Create failing test for bug reported in vitessio#12992, where a TrafficSwitch dry run for reads during resharding tries to actually switch reads and fails

Signed-off-by: Rohit Nayak <[email protected]>

---------

Signed-off-by: austenLacy <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Co-authored-by: Rohit Nayak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants